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Abstract:  Studies  of  probabilistic  modeling  and  simulation  focused  on  some  special 
situations  and  behaviors,  including  rare  events  and  scenarios,  occurred  in  large  scale  and 
complex  systems  have  been  performed  extensively.  Flowever,  few  studies  have  explored  the 
modeling  and  simulation  techniques  to  seamlessly  cover  multiple  scopes  including  both 
major  (frequent)  and  minor  (rare)  situations  and  behaviors  embedded  in  a  given  large  data 
set.  A  main  obstacle  to  develop  such  techniques  comes  from  the  difficulty  to  capture  the 
situations  and  the  behaviors  having  highly  contrasted  probabilities  in  a  unique  model  of  the 
data  distribution.  Two  technical  issues  must  be  addressed  for  overcoming  this  obstacle;  (a) 
weighting  instances  in  a  given  large  data  set  and  (b)  selecting  prototypes  from  a  given  large 
data  set.  Particularly,  the  latter  is  for  the  modeling  from  massive  data  to  which  the  thorough 
access  is  not  feasible.  A  method  to  address  these  two  issues  preserves  the  variety  of 
instance  distributions  of  the  data  set  and  provide  the  basis  of  the  seamless  simulations  over 
the  multiple  scopes.  In  the  first  year,  we  performed  a  mathematical  analysis  to  derive  the 
required  conditions  on  our  targeting  method  and  designed  a  principle  of  the  method  which 
largely  alleviate  the  obstacle  by  efficiently  sampling  the  required  prototypes  with  their 
appropriate  weights.  In  the  second  year,  we  implemented  the  designed  principle  of  the 
targeting  method  to  an  algorithm  in  computers  and  evaluated  its  generic  performance 
preserving  the  variety  of  instance  distributions  of  the  data  set  in  the  selected  prototypes. 
The  mathematical  analysis,  the  designed  principle,  the  implemented  algorithm  and  its 
performance  evaluation  presented  in  this  report  is  the  first  work  in  worldwide  for  the 
seamless  and  comprehensive  probabilistic  simulations  of  the  large  scale  and  complex 
systems. 

Introduction:  Needs  of  probabilistic  simulations  of  large  scale  and  complex  systems  are 
rapidly  increasing.  Various  situations  and  behaviors  occur  under  very  special  conditions  with 
some  low  probability  in  such  systems  because  of  their  huge  and  complex  structures.  For 
example,  a  particular  protein  folding  structure  is  known  to  occur  under  very  limited 
conditions.  A  giant  typhoon  is  also  induced  by  special  conditions  consisting  of  various 
meteorological  factors. 

Accurate  probabilistic  simulations  of  all  situations  and  behaviors  of  the  large  scale  and 
complex  systems  are  not  usually  tractable,  because  the  systems  and  their  behaviors  are  too 
complex  to  be  well  modeled  and  computed  by  using  our  background  knowledge.  State  of  the 
art  to  overcome  this  difficulty  is  to  combine  a  divide  and  conquer  approach  and  a 
data-driven  approach.  The  divide  and  conquer  approach  limits  the  scope  of  the  modeling 
and  the  simulations  to  our  interested  situations  and  behaviors.  For  example,  we  focus  the 
simulations  on  a  protein  folding  structure  by  limiting  biological  and  physical  conditions  to 
make  it  occur.  For  the  giant  typhoon,  we  also  limit  the  simulation  to  some  specific  rare 


DISTRIBUTION  A.  Approved  for  public  release:  distribution  unlimited. 


conditions.  The  data-driven  approach  introduces  more  empirical  modeling  based  on  the 
observed  data.  For  example,  probabilistic  free  energy  models  of  the  protein  molecule  include 
many  empirical  parameters.  Some  parts  of  a  probabilistic  typhoon  model  are  also  derived  by 
empirically  observed  data.  This  framework  reduces  the  difficulty  of  the  system  modeling  and 
achieves  the  sufficient  accuracy  in  the  simulations  with  their  tractability. 

However,  the  divide  and  conquer  approach  provides  the  models  and  the  simulations 
fragmented  into  the  individual  scope,  and  their  interpretability  are  limited.  This  drawback  of 
the  current  framework  also  reduces  its  applicability  to  practical  problems  including  analysis, 
control  and  management  of  the  systems  across  their  multiple  scopes.  These  difficulties  will 
become  more  significant  in  near  future,  since  the  scale  and  the  complexity  of  the  systems 
and  the  problems  are  significantly  increasing  in  our  modern  society.  On  the  other  hand, 
amount  of  data  observed  from  the  systems  is  rapidly  increasing,  and  this  provides  "big  data". 
Since  the  big  data  is  acquired  under  various  situations  and  behaviors  of  the  systems,  it  tends 
to  cover  their  many  scopes.  Accordingly,  we  may  obtain  better  models  for  the  simulations  in 
a  data-driven  manner.  However,  such  models  are  not  easily  derived,  since  thorough  access 
to  all  instances  in  the  big  data  is  not  tractable. 

Aim  and  Goals 

In  this  study,  we  aim  to  overcome  the  drawbacks  of  the  current  probabilistic  simulation 
framework  and  also  to  address  the  issue  of  the  modeling  which  uses  the  big  data.  We  target 
the  following  research  goals  under  these  aims: 

(1)  We  investigate  the  mathematical  principle  for  the  data-driven  probabilistic  modeling  to 
capture  variety  of  the  instance  distribution  in  a  given  data  set  for  covering  multiple 
scopes  of  our  objective  system  in  a  seamless  manner. 

(2)  Based  on  the  investigated  mathematical  principle,  we  characterize  the  required 
prototype  distributions  in  a  subsample  data  set  selected  from  the  given  data  set  to 
provide  such  a  seamless  probabilistic  model. 

(3)  We  develop  an  instance  weighing  and  sub-sampling  algorithm  for  the  prototype 
selection  preserving  the  variety  of  the  instance  distribution  in  a  large  data  set.  This 
enables  highly  tractable  probabilistic  modeling  of  the  objective  system  over  its  multiple 
scopes  by  efficiently  extracting  instances  representing  each  scope  of  the  system  from 
the  big  data. 

(4)  Performance  of  the  developed  method  for  the  instance  weighing  and  the  prototype 
selection  is  evaluated  and  confirmed  through  some  simulation  examples. 

In  the  first  year,  we  theoretically  investigated  the  goals  (1)  and  (2).  For  the  goal  (1), 
we  introduced  our  problem  setting  at  first,  and  second,  we  defined  some  measures  to 
characterize  the  optimal  subsample  distributions,  and  third,  we  seek  some  mathematical 
analysis  method  for  the  characterization.  For  the  goal  (2),  by  applying  the  mathematical 
analysis  method,  we  characterize  the  required  distributions  of  instances  in  a  subsample  data 
set  of  a  given  original  large  data  set.  The  instances  having  the  distributions  in  the  subsample 
data  set  minimize  these  measures  and  preserve  the  varieties  of  the  distribution  of  the 
original  data  set  in  a  compact  fashion  by  the  nature  of  these  measures. 

In  the  final  year,  we  worked  on  the  goals  (3)  and  (4).  For  the  goal  (3),  an  algorithm 
which  efficiently  subsamples  the  instances  as  prototypes  by  following  the  required 
distribution  from  the  large  original  data  set  while  minimizing  the  measures  were  designed 
based  on  a  principle  of  the  instance  weighting.  Then,  for  the  goal  (4),  the  basic  performance 
of  the  designed  algorithm  was  evaluated  through  some  generic  simulations,  and  its 
promising  performance  was  confirmed. 

Problem  Setting:  The  population  distribution  of  a  very  large  data  set  is  denoted  as  f(x), 
where  x  is  a  continuous  variable  on  H,  and  the  support  of  f(x)  is  limited  to  [xmin,  xmax].  Let 
X  =  {xi,i  =  l,---,N}  with  very  large  N  be  samples  independently  generated  from  f(x) . 
An  extra  auxiliary  distribution  is  defined  as  g(x)  ,  and  another  subsample  set 
Y  =  {3cj, — ,3c„}  ( n«N )  is  independently  sampled  from  X  according  to  their  sampling 
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weights  7t{xi)  =  \ I  coixf)  making  the  population  distribution  of  Y  be  g{x) .  This  <y(x;.)  is 
given  by 


a>(xi)  = 


fix,) 

8ix,)  ' 


(1) 


In  Eg.  (1),  we  assume  g(x)*  0  whenever  f(x)±  0. 


Let  an  estimator  of  f(x) :  f(x\  gix),X,Y)  is  the  following  weighted  Kernel  density 
estimator  using  Y  [1], 


H*W).X.Y)  =  =  i-£  ^  K("  “  ' 


h  nh  i=1  g(x,) 


h 


)• 


(2) 


Our  problem  is  to  derive  an  optimum  auxiliary  distribution  g  t(x),  which  minimizes  a  given 


error  measure  M(g,X,Y)  between  f(x)  and  f(x  I  g(x),X,  Y)  under  the  data  set  X 
independently  sampled  from  f(x).  Note  that  g  ( x)  derived  here  defines  the  optimum 
co(x ',)  by  Eq.(l).  Our  further  problem  is  to  design  efficient  algorithm  for  subsampling  Y 
from  X  according  to  the  weight  x(x,)  =  l/o(3c,-)  . 


Measures:  Our  candidate  measures  M(g,X,Y )  chosen  for  preserving  the  varieties  of 
the  distribution  of  X  in  Y  are  Mean  Integrated  Square  Percentage  Error  (MISPE)  and  Alpha 
Divergence. 


MISPE 

The  simplest  candidate  measure  is  the  following  Mean  Integrated  Square  Percentage  Error 
(MISPE)  [2], 


MISPE[f(x)f(x\g(x),X,Y)]  =  E 


r  f(x)~  f(x\g(x),X,Y)V 


fix) 


dx 


J 


=  E 


'fix)-fjx\gjx),X,  Y)V 
fix) 


(3) 


dx 


Because  f(x)  in  the  denominator  weights  the  error  between  f(x)  and  f(x\  gix),X,  Y) , 
fix\ gix),X, Y)  minimizing  this  measure  reflects  f(x)  more  when  it  is  smaller.  Hence, 


the  resultant  Y  and  its  coixf  is  supposed  to  capture  the  varieties  of  /(x) . 


Alpha  Divergence 

A  divergence  has  been  considered  as  a  dissimilarity  measure.  Some  of  its  properties  [3] 
allow  us  to  minimize  alpha-divergence  to  find  the  best  approximating  distribution.  Firstly, 
alpha-divergence  is  zero  if  f  —  f  and  positive  otherwise,  so  it  satisfies  the  basic  property 
of  an  error  measure.  The  property  follows  from  the  fact  that  alpha-divergences  are  convex 
with  respect  to  /  and  f  [4],  The  alpha-divergence  being  used  as  an  error  measure  has 
the  variant  structure  with  different  selection  of  parameter  a .  For  example,  the  case  with 
a  =  0.5  is  known  as  Hellinger  distance.  The  case  with  a  =  -l  is  considered  as  a  measure 
similar  to  the  mean  integrated  squared  percentage  error  (MISPE).  The  case  with  a—>0  or 
1  is  defined  as  KL-divergence. 
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Let  an  alpha-divergence  D(a\f(x)\f(x\g(x),X,Y))  be  an  error  measure 

between  f(x)  and  f(x\  g(x),X,  Y) .  In  our  analysis,  we  design  g(x)  to  minimize  this 

divergence  measure.  The  original  definition  of  the  alpha-divergence  is 

,  .  \\fa{x)fl-a(x\g(x),X,Y)-qf{x)  +  {cc-\)f{x\g{x),X,Y))dx  ... 

D'a)(f  (x)  I  f(x\g(x),X,Y))  =  ^ - - - ,  a  *  0,1. ^ 

a(a- 1) 

The  case  with  0  or  1  is  defined  as  KL-divergence,  which  is  given  by 
lim Dw(f(x)  I  /(x  I  g(x),X,Y))  =  KL(f  (x  I  g(x),X,Y)  I  /(x))  (5) 

a— >0 


lim  D{a\f{x)  I  f{x  I  g(x),X,Y))  =  KLifix )  I  fix  I  g(x),X,Y))  (6) 

a->l 


Eg.  (4)  can  be  reformulated  to  a  simpler  expression  as 


D(a)(f(x)lf(xlg(x),X,Y)) 


1+f  fa(x)fla(x\g(x),X,Y)dx 

J — CO 

a{a  - 1) 


,  a  *  0,1,  fix)  *  0. 


(7) 


In  Eq.  (7),  f(x)  causes  singularity  of  alpha  divergence  when  /(x)  =  0.  Therefore,  without 
loss  of  generality,  we  exclude  the  area  of  /(x)  =  0  from  the  integral  and  assume  /(x)^0  in 

the  following  analysis.  f(x)  has  the  similar  effect  on  the  alpha  divergence,  and  we  assume 
that  /(x)^0  whenever  /(x)^0. 


Calculus  of  Constrained  Variations  and  Characterization  of  Optimal  Auxiliary 
Distributions:  For  deriving  coixf  of  the  instances  in  the  subsample  data  set  using  Eq.(l), 

we  need  to  know  the  optimum  auxiliary  distribution  go  fx)  minimizing  M ig,X ,Y) .  Since 
g  ix)  is  a  probability  density  function  which  integral  over  the  entire  R  is  unity,  this 
optimization  problem  of  gix )  with  its  integral  constraints  is  generally  falls  into  the  following 
"Calculus  of  Constrained  Variations”  [5]  and  is  represented  as 


rb 

Minimize :  J  =  L(x,  y(x),  y\x))dx, 

Ja 

rb 

Subject  to  :  I  =  Gix,  y(x),  y\x))dx, 

Ja 


(8) 


where  y(x)  is  to  be  optimized,  and  /  is  a  known  constant.  To  solve  this  problem, 
L(x,y(x),y'(x))  is  extended  to  L[x,y[x),y'[x),X)  which  includes  the  Lagrange  multiplier. 

L(x,  y(x),  y'(x))  =  L(x,  y(x),  v'(x))  +  2{/  -G(x,  v(x),  v'(x))},  (9) 

Where  X  is  a  constant.  Then,  the  optimization  problem  of  L(x,  v(x),  y'(x))  is  transformed 

to  the  following  standard  "Calculus  of  Variations"  of  the  extended  L(x,y(x),y\x),A)  without 
any  constraint  [6]  and  [7], 

rb  — 

Minimize:/ =  L(x,  yix),  y\x))dx.  (10) 

Ja 

The  optimium  y(x)  is  known  to  be  the  solution  of  the  following  partial  differential  equation. 


dL 

dyix) 

In  case  of  MISPE  (See  Appendix  A), 


_d_ 

dx 


dL 

\dy\x)  j 


0. 


(ID 
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L(x,  g(x))  =  MISPE[f(x)\f(x  I  g(x),X,  Y)]  =  f 

1  J  00 


/(*)-/(*  I  g(*),X,Y) 
/(*) 


dx 


/•co 

with  the  integral  constraint  g(x)dx  =  1  is  to  be  minimized.  Thus,  by  formulating 

J— oo 

L(x,g(x))  =  L(x,g(x))  +  A( 1-f  g(x)dx) ,  (12) 

J— 00 


we  solve  Eq.(ll).  This  results  the  following  optimum  g(x). 

1 


8  «*(*)  =  ■ 


I-Zl. 

Anh 


0 


Xef-Xmin’Xma J. 

othem’ise. 


I  n  case  of  the  alpha-divergence  (See  Appendix  B), 

f{l  —  J  /(a)(x)/1_a(xl g(x),X,F)dx}/{a(a-l)}  fora^O,!, 


L(x,g(x))  =  (/(x)  I  f(x I  *(*), X.T))  =  \  KL(f(x)  \f(x\  g(x), X , Y)) 

KL(f(x\g(x),X,Y)\f(x)) 


fora  — >1, 
for  a  ->  0, 


(13) 


(14) 


poo 

with  the  integral  constraint  g(x)dx  =  1  is  to  be  minimized.  By  applying  Eq.(9)  and 

J— oo 

Eq.(ll),  we  obtained  the  following  solutions. 


(A)  Case:  a  *0,1,2 


i 

[exp(M(x))^(x)]“-1 

§opAX )  1  ’ 

/•00  - 

[exp  (u(x))S(x)]a~ldx 

J-oo 


(15) 


where  u(x)  =  orln(/(x))  + 


a  K'( 0) 
h  K( 0) 


J 


dx  and  S(x) 


rexp  j-u(x)) 
'  (x-Fa[X]) 


(B)  Case:  a  =  2 

[exp(«(x))^(x)] 

S0pAx'  ’ 

[exp(w(.x))c>(.x)]d.x 

J-oo 


(16) 


where  u(x)  =  2ln(/(x))  +  2ln(x-£s[X]) 


2  X'(0) 
h  K( 0) 


jdx  and  £(x)  =  J 


exp(-u(x)) 

(x-Fa[X]) 


(C)  Case:  or  — >  1  (The  solution  of  a  — >  0  is  similarly  obtained.) 


8oP,(x) 


[exp(w(x))<J(x)]  1 
[exp(w(x))£>(x)]  1  dx 


(17) 


where  u(x)  =  ln(x-£g[X])+C  and  S(x)  = 


fexp(-u(x))  1 
J  (x-Efl[X]j  f(x) 
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Algorithm  to  Select  Prototypes:  The  optimal  auxiliary  distributions  gopl(x)  given  by  the 

alpha  divergence  depends  on  f(X),  and  thus  it  is  not  easily  computed  because  f(X)  is 
unknown.  In  contrast,  gopt(x )  based  on  the  MISPE  measure  is  uniquly  given  as  the  uniform 

distribution  shown  in  Eq.(13).  Accordingly,  we  focus  on  this  simpler  case  in  our  further  study. 
The  sampling  algorithm  of  Y  from  X  must  be  designed  to  sample  each  instance  from  X  which 
follows  f(x)  over  R  with  a  uniform  probability  density  got(x )  over  R  by  applying  the 

sampling  weight  =  l/ry(.T;.)  designated  by  Eq.(l),  so  that  the  samples  in  Y  drawn 

from  X  are  uniformly  distributed  everywhere  in  the  support  of  f(x). 

An  issues  to  design  this  algorithm  is  the  lackness  of  the  infornation  on  true  f(x)  which 
is  needed  to  directly  compute  correct  7i(xt)  =  H coix^  .  We  introduce  an  iterative 
approximation  alogorithm  named  "Wang-Landau  algorithm"  [8]  to  derive  the  subsample  set 
Y  following  the  uniform  distribution  gopt{^i)  ■  The  key  idea  of  this  algorithm  is  to  use  a 

histogram  of  the  prototype  set  Y  over  [xmm,xmax]  and  gradually  modify  =  1/ <»(.*,.) 

at  each  bin  of  the  histogram  in  iterative  computations.  Let  X  j  cz  [xmin,xmax\ 
(j  =  1  be  a  bin  of  the  histogram  of  the  subsample  data  set  Y.  The  histogram  consists 
of  B  bins  partitioning  [xmm,xmax] .  We  define  a  weight  of  each  bin  as  n(X .) ,  and  let  the 


weight  of  each  x,  in  Y  be  ^(xt)  =  n(X ;.)  subject  to  xi&Xj. 

Initially,  we  give  a  tentative  arbitrary  weight  value  to  n{X j)  of  each  bin.  This  is 
equivalent  to  assume  some  arbitrary  density  f(xt)  for  xi  e  X  j  as 

/(5c,.)  =  <?opr(X|)  =  gop,(Xi)  =  where  c  = - - - .  (18) 

*<?,)  *( Xj ) 

After  each  trial  to  draw  x  from  X,  we  select  x  into  Y  as  xi  in  probability  which  is 


proportional  to  the  weight  x(x)  =  7r(Xj)  subject  to  xeXj  as 

fiiXjMxj 

PW  =  ~B - ' 

Y^KXj)? t(X.) 

7=1 


(19) 


where  h(X  .)  is  a  frequency  of  the  instances  included  in  a  bin  X  j . 

Subsequently,  /t{X  .)  is  lessen  by  multiplying  a  constant  factor  0  <F  <  1  to  it.  This 

procedure  reduces  the  weight  of  the  instance  x  belonging  to  a  bin  having  a  large 
frequency  in  the  histogram  of  Y,  i.e.,  the  instance  x  frequently  selected  into  Y  from  X.  The 
instances  more  frequently  sampled  in  Y  become  less  sampled  by  their  lowered  weights,  and 
the  instances  less  frequently  sampled  in  Y  become  more  sampled  by  their  relatively  lifted 
weights.  Accordingly,  this  procedure  has  an  effcet  to  flatten  the  histogram's  shape. 

These  instance  selection  and  weight  update  are  repeated  and  then  posed  when  the 
histogram  becomes  almost  flat.  At  this  point,  Y  and  the  frequency  h(X  .)  of  its  all 

histogram  bins  are  reset  to  empty  while  keeping  their  latest  weights  n (X  .).  In  addition, 

the  factor  F  is  changed  by  F  <—  4~F  to  make  it  closer  to  1.  This  updated  factor  F  enables 
finer  tuning  of  the  weights  to  more  precisely  flatten  the  histogram.  Then,  the  instance 
selection  into  Y,  the  weight  update  and  the  factor  update  are  further  repeated  until  we 
obtain  Y  having  a  sufficiently  flattened  histogram,  i.e.,  a  uniform  distribution. 
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This  Wang-Landau  algorithm  indirectly  reflects  the  population  distribution  of  X,  f(X), 
to  the  weights  of  the  histogram  bins,  i.e.,  the  weights  of  A{xi)  =  A{Xj)  subject  to 

xi  eXj ,  by  iteratively  modify  them  while  computing  the  frequencies  of  the  bins,  and 

achieves  the  sufficiently  uniform  gopt(Xi)  °f  Y.  A  strong  advantage  of  this  algorithm  is  that 

we  can  efficiently  derive  the  prototype  set  Y  having  a  sufficiently  uniform  distribution 
without  assessing  entire  distribution  of  X,  i.e.,  accessing  the  entire  data  set  X  which  can  be 
very  huge. 

The  following  is  a  pseudo-code  of  this  algorithm.  Here,  we  use  logarithmic  weights 
LP(X j)  =  logft(Xj)  in  place  of  jt(X j)  ,  and  further  we  define  logarithmic  factor 

LF  =  -log F  (i.e.,  with  a  minus  sign)  in  place  of  F.  These  are  because  the  amplitudes  of 
the  weights  can  vary  over  many  orders  of  magnitude. 

Wang-Landau  algorithm: 

l.lnitialize  LP(X  .)  and  LF\  set  other  parameters. 

-  Set  LP(Xj)  =  0  for  j  =  1 . B. 

-  Set  LF  >  0  (e.g.,  LF  =  — log(l/e)  =  1). 

-  Set  the  maximum  number  of  iterations  Kmax  (e.g.,  Kmax  =  18). 

-  Set  the  counter  of  iterations  K  to  0. 

2.  Initialize  a  prototype  set  Y  and  its  histogram  H. 

-  If  K  >  Kmax,  end. 

-  Make  the  prototype  set  Y  empty. 

-  Set  h(Xj)  =  0  forj  =  1,  .  .  .  ,  B. 

3.  Selection  of  an  instance  from  X  into  Y. 

-  Randomly  sample  x  from  X  and  select  it  into  Y  in  probability  proportional  to 
x(X j)  =  exp(LP(X j))  subject  to  xeXj. 

4.  Modify  LP(X  .)  and  update  the  histogram  H. 

-  LP(Xj)  <—  LP(X  j)-LF  subject  to  xsXj. 

-  h(Xj)^h(Xj)  + 1  subject  to  xeXj. 

5.  Check  whether  H  is  "sufficiently  flat." 

-  If  so,  LF  <—  LF H,  K  =  K  +  1  and  go  to  Step  2. 

-  Otherwise  go  to  Step  3. 


The  application  of  this  Wang-Landau  algorithm  is  not  limited  to  derive  the  prototype 
set  Y  having  the  uniform  distribution.  By  changing  the  condition  at  the  step  5;  Check 
whether  H  is  "sufficiently  flat"  to  any  target  distribution,  this  algorithm  can  derive  Y  having 
the  distribution.  Accordingly,  this  is  also  applicable  to  Y  given  under  the  alpha-divergence 
measure,  if  Eq.(  15),  (16)  and  (17)  are  represented  by  some  analytical  forms  or  numerically 
solved. 


Experiment  and  Results:  We  applied  the  Wang-Landau  algorithm  with  the  MISPE 
measure  to  a  virtual  big  data  set  X  having  the  following  Weibull  distribution. 

k 

I1 

0  x  <  0, 


f(x;A,k)  =  < 


,-(xny 


x  >  0, 


where  A=1.0  and  k= 1.5.  This  distribution  is  known  as  a  typical  example  distribution  having  a 
significant  long  tail,  i.e.,  a  large  portion  of  the  probability  belongs  to  a  wide  range  of  the 
probability  variable  with  very  low  probability.  Therefore,  this  distribution  produces  a  data  set 
X  which  large  part  consists  of  rare  instances.  The  big  data  set  X  was  not  generated  at  the 
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step  3  of  the  pseudo-code  in  reality.  Rather,  the  direct  draw  of  each  instance  x 
from  f(x;A,k )  was  performed.  This  is  equivalent  to  have  an  infinite  data  set  X.  We  limited 
the  range  of  x  to  a  finite  interval  [0,5]  for  avoiding  computational  divergence.  The  initial 
values  of  the  logarithmic  weights  LP(X  =  log  7i{X  j )  and  the  logarithmic  factor 

LF  =  -logF  were  set  at  0  and  1,  respectively.  Kmax=18  was  used  for  the  number  of  the 
factor  updates  and  the  histogram  reconstructions. 

We  assessed  the  performance  of  our  proposed  method  by  checking  if  such  rare 
instances  are  efficiently  sampled  over  the  long  tail  of  the  distribution.  Figure  1  shows  the 
histogram  of  the  samples  in  the  prototype  set  Y  together  with  the  plot  of  the  Weibull 
distribution  having  A=1.0  and  k=1.5.  This  clearly  shows  that  Y  almost  uniformly  includes 
variety  of  samples  over  the  entire  range  of  x ,  and  the  rare  prototypes  in  the  range  [4,5] 
having  almost  negligible  probability  f(x)  form  a  significant  portion  of  Y.  The  similar  prototype 
set  can  be  obtained  by  accessing  the  entire  data  set  X  and  selectively  acquiring  the  rare 
prototypes.  However,  the  efficiency  of  this  approach  depends  on  the  sizes  of  X  and  Y.  For 
example,  if  the  size  of  given  X  is  160,000  and  the  required  size  of  Y  is  100,  its  efficiency  to 
obtatin  an  prototype  in  Y  is  100/160, 000=6. 25xl0'4.  This  efficiency  is  comparable  with  that 
of  our  proposed  prototype  sampling,  6.48xl0'4,  which  is  the  ratio  of  the  sample  population 
acquired  in  Y  over  the  total  number  of  the  draws  from  X,  /'.a,  f(x\A,k) .  On  the  other  hand, 
if  the  size  of  given  X  is  16,000,000  and  the  required  size  of  Y  is  100,  its  efficiency 
100/16, 000, 000=6. 25xl06  is  far  worse  than  the  efficiency  of  our  method.  In  other  words, 
our  approach  is  more  efficient  than  the  thorough  access  to  X  to  acquire  100  prototypes  in  Y, 
if  the  size  of  X  is  more  than  160,000.  This  comparison  demonstrates  the  advantage  of  our 
proposed  approach  for  the  prototype  sampling  from  a  big  data  set. 


Figure  1.  Weibull  distribution  f( x; A, k)  having  A=1.0  and  k=1.5  and  histogram  of  Y. 

Next,  we  compared  MISPE  and  MISE  between  our  approach  and  the  standard  random 
sampling.  In  our  approach,  the  probability  density  f(x; A, k)  of  the  Weibull  distribution  is 
estimated  by  using  Eq.(l)  as 
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where  gh(Xj)  and  7i{X j)  are  the  normalized  frequency  and  the  weight  of  the 

histogram  bin  Xjt  respectively.  In  the  standard  random  sampling,  f(x;l,k)  is  simply 

computed  by  the  histogram  of  the  randomly  sampled  prototype  set  Y.  Once  f(x;A,k)  is 
estimated  in  the  both  methods,  their  MIPSE  are  computed  by  Eq.(3),  and  their  MISE  are 
computed  by  the  standard  mean  integrated  square  error  between  f(x;A,k)  and  f(x; A, k) . 
Figure  2  show  the  comparison  of  their  MIPSE.  By  repeating  the  factor  updates  and  the 
histogram  reconstruction  in  our  approach,  its  MIPSE  got  far  smaller  than  that  of  the 
standard  random  sampling.  Figure  3  depicts  the  comparison  of  their  MISE.  Though  the  MISE 
of  our  approach  reduces  along  the  iterations,  it  does  not  reach  the  level  of  the  MISE  of  the 
standard  method.  These  results  are  consistent  with  our  theoretical  analysis  presented 
earlier. 

Discussion  and  Conclusion:  In  this  study,  we  have  theoretically  investigated  principles  of 
data-driven  probabilistic  modeling  to  capture  variety  of  the  instance  distribution  in  a  given 
data  set  for  covering  multiple  scopes  of  our  objective  system  in  a  seamless  manner.  We 
analyzed  two  error  measures,  MISPE  and  alpha-divergence,  to  derive  the  required 
distributions  of  instances  in  a  subsample  data  set  of  a  given  original  large  data  set,  and  we 
presented  a  mathematical  principle  to  derive  the  optimal  distributions  of  the  instances  in  the 
subsample  data  set  required  to  minimize  these  measures.  The  instances  in  these  subsample 
data  sets  are  expected  to  minimize  these  measures  and  preserve  the  varieties  of  the 
distribution  of  the  original  data  set  in  compact  fashions  by  the  nature  of  these  measures. 


12 


0 


0 
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Figure  2.  Change  of  MIPSE  over  the  iterations  of  the  factor  updates  and  the  histogram 
reconstructions  in  our  approach  and  comparison  with  MIPSE  provided  by  the  standard 
random  sampling. 
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Figure  3.  Change  of  MISE  over  the  iterations  of  the  factor  updates  and  the  histogram 
reconstructions  in  our  approach  and  comparison  with  MISE  provided  by  the  standard 
random  sampling. 

We  further  investigated  an  instance  weighing  and  sub-sampling  algorithm  named 
"Wang- Landau  Algorithm"  for  the  prototype  selection  preserving  the  variety  of  the  instance 
distribution  in  a  large  data  set.  This  enables  highly  tractable  probabilistic  modeling  of  the 
objective  system  over  its  multiple  scopes  by  efficiently  extracting  instances  representing 
each  scope  of  the  system  from  the  big  data. 

Finally,  we  evaluated  the  performance  of  the  developed  method  for  the  instance 
weighing  and  the  prototype  selection  through  some  simulation  examples.  The  achievement 
of  the  prototype  selection  preserving  the  variety  of  the  instance  distribution  in  a  large  data 
set  has  been  demonstrated  with  its  efficiency.  The  approach  can  be  used  for  the  efficient 
prototype  selection  from  the  big  data  set.  Moreover,  the  superior  accuracy  of  the  proposed 
prototype  selection  method  in  terms  of  the  MIPSE  measure  has  been  confirmed. 

An  issue  remained  for  future  work  is  the  implementation  of  the  prototype  selection 
algorithm  using  the  alpha-divergence  measure.  Since  the  distributions  of  the  prototypes 
meeting  with  this  measure  have  not  been  solved  in  analytical  manner  yet,  more  extensive 
studies  to  clarify  the  characteritics  of  the  solutions  and  to  develop  some  numerical  method 
for  their  implemenation  to  the  computational  algorithm  are  needed. 
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Appendix  A:  Analysis  on  Optimal  g(x)  for  Ml  SPE 

We  design  g(x)  to  minimize  the  error  measure  M/SP£[/(x|g(x),S,D)]  between  f(x)  and 
/(x | g(x),s,D)  .  Assume  the  support  of  f(x)  is  [x_min,  x_max]  where  /(x)  =  0  in  its 
outside.  Then  MlSPE^f[x\g[x),S,D]^  is  expressed  by 

MISPE[f(x\g(x),S,D)] 


=r 

=r 


px_ma: 
Jx  min 


_  .  Eg 

min  y 


f(x)-f(x\g(x),S,D) 

m 


f(x)-f(x\g(x),S,D) 


dx 


dx 


x_max  1 

7m’ 

Jx-maxZ_  (x,g(x)) dx, 

Jx  min 


m 

(f(x)-f(x\g(x),S,D)J  dx 


(20) 


where  L(x,g(x))  = 


f2(x)  9 


(/M-/(x|g(x),S,D)) 


.  According  to  Eg.  (9),  L  (x,g(x))  is  set 


as  L  (x,g(x))  =  L  (x,g(x))  + Ag(x) .  By  the  constraint  f*  maxg(x)dx  =  l , 

Jx_min 

/W/SP£[/(x|g(x),S,D)]  is  written  by 

MISPe\ f(x\g(x),S,D)~\=  r~max L  (x,g(x))dx,  s.t.  fx_maxg(x)c/x  =  l. 

L  J  Jx_min  Jx_min 


(21) 


Our  target  is  to  obtain  the  optimal  g(x)  by  minimizing 


MISPE 


f(x\g(x),S,D) 


as 


g  t(x)  =  argmm/V7/SP£r/(x|g(x),5,D)-|  =  argmin  ("*  max/.  (x,g(x))dx,  s.t.  f*  maxg(x)dx  =  l,  (22) 

g(x)  L  -I  g(x)  Jx_min  Jx_min 

where  gopt(x)  is  the  optimal  solution.  To  achieve  this  goal,  the  calculus  of  variation  is 


applied.  If  we  want  to  obtain  the  optimal  g(x)  by  minimizing 


MISPE 


f(x\g(x),S,D) 


then  this  problem  is  called  calculus  of  variation  with  integral  constraint.  In  our  problem, 

y(x)=g(x)  and  there  is  no  g'(x).  Thus  Eq.  (11)  is  reduced  to  dL/dy(x )  =  Owhich  is  Eq.  (23). 
The  optimal  g(x)  should  be  obtained  by  solving  the  following  equations  by 


dL  (x,g(x))  _8L  ( x,g{x ))  |  ^ 
■  dg(x )  dg(x ) 

px_max 

g(x)dx  =  1. 

Jx_min 


8E„ 


(f(x)-f(x\g(x),S,D)]j 


fix) 


dg[x) 


+  A,  —  0 


(23) 


Eq.  (23)  is  written  in  another  form  by 
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1  n  f  I  V  )  (  y  —  X  ^  ~ 

f(x\g(x),S,D)=-Yj^A^K  ,  X,^g(x)  (27) 

niigiX,)  y  h  J 

where  Xi,i  =  l,---,n  are  samples  from  the  weighted  sample  set  D  .  The  expectation  of 
f(x\g(x),s,D)  is 


dEg  f[x\g[x),S,D ) 

So  we  obtain  - — — - =  0 ,  because  the  expectation  of  f{x\g[x),s,D)  does  not 

<g(x) 

contain  g(x)  term. 
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,[/2(x|g(x),S,D)]  =  -^£g 


zz 

/= 1  J=1 


g(X,)  h  g(X.) 


=  "yEg 
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g(x,)  i 


y 
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E 

/ 

h2 

Q 
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f(Xi)K(X~Xl) 
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-r 
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g(x) 
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h2 

f[y)K 
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(29) 


where  /c(t)  is  assumed  to  be  symmetric  and  crf  =  fx  max/c2  (t)c/f  .  Then  we  obtain 

Jx  min  7 


dE„ 


f2(x\g(x),S,D) 


8g(x) 


by 


5fg[/  (x|g(x),5,D)  ncrt  /2(x) 


8g(x) 


h  g2(x) 


(30) 


d£„ 


Then 


(/(x)-/(x  |  g(x),S,D)j2 


dg{x) 

dE„ 


in  Eq.  (25)  is  written  by 


(/(x)  -  /(x  |  g(x),  S,  D)  ^ 


dg(x) 

3fg[/2(x|g(x),S,D)]  d£g[/(x|g(x),S,D)] 


5g(x) 


5g(x) 


nat  f2(x) 


h  g2(x) 

The  expression  of  g(x)  is  obtained  by 


=  -Af2(x). 


9M 

Ah 


(31) 


(32) 
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(33) 


(34) 


(35) 


(36) 


Appendix  B:  Analysis  on  Optimal  g(x)  for  a  -divergence 

Let  alpha-divergence  Dta,(/(x)  I  f(x\ g(x),S,D))  be  an  error  measure  M  (g,S,D) 
between /(x)  and  /(x I g(x), 5, D) .  In  our  analysis,  we  want  to  design  g(x)  to  minimize 
this  divergence  measure.  The  original  definition  of  the  alpha-divergence  is 

f  ( fa(x)fi-a(.x\g(x),X,Y)-ctf(x)  +  (a-\)f(x\g(x),X,Y))dx  /oyx 

Dia)  (f(x)  I  f(x  I  g(x),X,Y))  =  - - - - - ,  a  *  0,1. 

a(a-l) 

The  case  with  a —>■  0  or  1  is  defined  as  KL-divergence,  which  is  given  by 

limD(a>(/(x)l  /(xl  g(x),X,Y))  =  KL(f(x\g(x),X,Y)\f(x)),  (38) 

a—>0 

lim  D[a\f(x)  I  f(x  I  g(x),X,Y))  =  KL(f(x)  I  /(x  I  g(x),X,Y)) .  (39) 

a—>l 

Eq.  (37)  can  be  reformulated  to  a  simple  expression  by 

.  .  ~  1+f  fa(x)fl  a(x\  g(x),X,Y)dx 

D{  \f  (x)  I  fix  I  gix),  X,Y ))  =  - - - - - ,  «  *  0,1,  fix)  *  0.  (40) 

aia  - 1) 

In  Eq.  (40),  /(x)  causes  singularity  of  alpha  divergence  when  /(x)  =  0 .  Therefore,  without 
loss  of  generality,  we  exclude  the  area  of  /(x)  =  0  from  the  integral  and  assume  /(x)^0  in 

the  following  analysis,  /(x)  has  the  similar  effect  on  the  alpha  divergence,  and  we  assume 

that  f(x)^  0  whenever  /(x)^0. 

1.  For  the  case  a  ^  0,1 

Since  /(x|g(x),5,D)  is  a  function  having  a  probability  distribution  according  to  the  statistics 
of  D  sampled  from  5  and  D  follows  g(x),  we  take  the  expectation  of  alpha-divergence  to 
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measure  the  difference  between  f(x)  and  f(x\g(x),S,D)  over  g(x)  as 
Eg[DM(f(x)\f(x\g(x),S,D))] 

1 


1- 


[  (f a  ( x )  f 1 a  ( *  |  g  ( x ) ,  5 ,  D )  )dx  J 
\l[fa{x)Eq[f1-a{x\g{x),S,D)\lx\ 


By  substituting  Eq.  (41)  into  Eq.  (42),  the  following  expression  is  derived 

gopt(x)  =  argminf  (L  (x,g(x)))dx  s.t.  f  g(x)dx  =  l, 

q(x)  J- co  J-co 


(41) 


1 

a(l-a) 


Our  research  target  is  to  obtain  the  optimal  g(x)  to  minimize 
Eg[DM(f(x)\f(x\g(x),X,Y))]  by 

gopt{x)  =  argminEg[DM(f{x)\f(x\g[x),S,m  s.t.  V  g{x)dx=l  (42) 


(43) 


where  L  (x,g(x))  =  — —fa(x)Eg^f  a(x\g(x),S,D)j  .  According  to  Eq.  (9),  L(x,g(x)) 

is  set  as  L  (x,g(x))  =  L  (x,g(x))  +  Ag(x) .  Correspondingly,  Eq.  (10)  is  written  by 
dL[x,g(x))  8L(x,g(x )) 


8g(x ) 


-+A 

dg(x) 

1  8Eg\f1-a(x\g(x)lS,D)] 

fa  (x)— L  J 


(44) 


«(!-«) 


8g(x) 


poo 

+  2  =  0  s.t.  g(x)dx  =  l. 

J  —00 


Then 


^[Z1  “(x|g(x),S,D)] 


8g(x) 


is  written  by 


oEg  [Z1  “(x|g(x),S,D) 


<3g(x) 


=  cc(l-a)Af-a(x). 


(45) 


Here,  we  consider  to  employ  kernel  estimator  to  estimate  /(x|  g[x),s,D)  by 

f(x  |  g(x),  S,  D)  =  I X  )K(^4, 

hti  h 


(46) 


where  Xi,i  =  l,---,n  are  samples  in  the  weighted  sample  set  D  and  g(x)  should  be 
g(x)* 0  whenever  f(x)*0.  w(x)  is  unknown  since  /(x)  is  not  given  in  practice.  However, 
we  assume  that  w(x)  is  given  by  Eq.(ll)  from  /(x)  and  g(x)  in  our  current  problem  setting  as 
noted  earlier  in  the  general  problem  description.  Now,  we  want  to  calculate 
£9[Z1““(*lgM,S,D)],  which  is  written  by 
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Eg  [/l  a  (x  l  0M/S,D)] 


/  ~  \  l-« 

('in  v_  y  \ 


v<>t ; 

( 


If 

KhtigiX,)  h 


V“ 


(47) 


9(/i  '■■■,y„)dy1---dyn. 


Since  xl,  --,Xn  are  i.i.d.  sampled  from  g(x) ,  we  take  the  form  g(y1,---,yn)=g(y1)---g(yn) 
Eq.  (47)  is  written  by 


/1^(x|g(x),S,D) 


\l-a 


(48) 


t=*=Vi- 

'  *  hn 


l  r™  f00  XT' 

®  J  —oo  J  —oo 


Z^^if 
S'Btx-Zit,)  J 


fjg(x-/7t;)c/t1  ■■■dtn. 


y=i 


Set  s(x/f)  =  s(x/[tl/--,tn]r)=  ±^+K(ti) 


,1-cc 


\  i~i  g(x-ht ,) 

multivariate  to  s(x,f) .  Then,  s(x,t)  is  written  by 

^'i-ZZ^nW 

y=o  i/»i=y  P  ■ 


.  By  applying  Taylor's  expansion  for 


_^y  1  aIAIS(x,t) 

-otf- 

1  5l/?ls(x,f) 


(f-oK 


(49) 


(t-0)* 


=  5(X,0)  +  X^|  t„ 


tr  dt, 


for  the  first  order  approximation,  where  p  =  (/71#- ••,/?„) ,  1^1=^  +  •••  +  /?„ ,  /?!  =  &  !•■•/?„  I, 
and  (f)^  =(f1)/?1  ,  in  which  the  term  s(x,f)  at  the  point  t  =  0  is  expressed  by 


^-t)|,.=fzZr£rT«t-) 

lf-°  [figix-ht,) 


\  1-a 

V 

t=0 

l  gM  J 

(50) 


Also,  the  term  5s(x,t) 

dt, 


is  written  by 


3s(x,  f ) 

5t,. 


(l-«) 


Z 

V'=1 


fix-ht,) 
g(x-ht ,) 


K(t,) 


-h 

(f(x~htf 

y 

vg(x-f?f,.)y 

/C(t,.)  + 


f{x~ht\>(t) 
g(x-ht ,) 


(51) 
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Then  at  the  point  t  =  0,  Eq.  (27)  is  expressed  by 


ds(x,t) 


dt, 


=  (!  —  «) 


v  g(x) 


m 


-h 


kqMj 


K(0)  +  ^-K'(0) 

g(x) 


(52) 


By  substituting  Eq.  (49),  Eq.  (50)  and  Eq.  (52)  into  Eq.(48),  we  obtain 
c,[/1_e(x|g(x),s,D)] 


+  ti\Y\9(x-hti)dt1--dtn 


■A  ds(x,t) 


=?  5t,-  L 


(53) 


h"  Tpds{x,t) 


h  ati  8t, 


(1-«)|  n^MO) 
g{x) 


-mMW)A101 
g(*)J  g  (x) 


(-*(0)1 

-fM 

-a- 1 

f— ]' 

*’( o)  /M 

U  J 

_g(x) 

UmJ 

hK( 0)  g(x) 

nMm] 

h  {  g(x)  J  h"  h1-" 


From  Eq.  (53),  the  partial  derivative  of  Eg  [/1_a(x|sf(x),S,D)j  with  respect  of  g(x) 
by 

8Eg[fla(x\gM,S,D)] 

eg(x) 


is  given 


-m 


fa(x)  l-a 


h  (  h  )  o  0+1  (x)  h 


/m'|  no)  m 

g(x)  j  hK( 0)  g(x) 


+fM 

1 3(->f) 


af  n  f 

hih 


f'(x)  l2f(x)g'(x)  |  r(0)  /(x) 


02M  g{x)g2(x)  hK(o)g2(x) 


1  a(x  e 

7(*)j 

f'M  l 

fix)  g'(x)  l 

g'lx) 

aK'(  0)  1  ] 

J  U mJ 

g(x)  a  9  {  f(x)  l 

g(x)  f(x) 

g(x)  g(x)  fix) 

g2  M 

h  K( 0)  g(x)  J 

i  (7(*)\ 
/(x)UwJ 


hU  j  lg(x) 


-a  /  .  \-« 


J _ l-g/j.  r  ryij  a  ( fM)  ,  g'M  a  «"(0)  1  1 

g(x)  a  ’  92M  /i/f(0)g(x)j 


By  substituting  Eq.  (30)  into  Eq.  (21),  the  following  equation  is  derived  by 

-f-M0)TT^r[— (x-F5[X]){— f 1  =  a{l-a»r(x). 

h{h  J  {g(x)J  g(x)  a  V  9  ;  /(x)U*)J  g2M  h  K(0)  g(x)  1 


(54) 

(55) 


Eq.  (55)  is  rewritten  by 


g'(x)=- 


a 


a- 1 


1  +  /'(x)  1  K'(0) 


a-l(x-Eg[X])  f(x)  h  K( 0) 


g(x)+  a  .  ^  9-a+2(x).  (56) 

a-l(x-Eg[X])Vh  J 


where  f(x)  is  assumed  to  satisfy  f[x)* 0  and  Eq.  (56)  doesn't  involve  any  singularity  by 


this  assumption.  Since 


f'(x) 

m 


is  included  as  the  coefficient  of  g(x) ,  Eq.  (56)  is  referred  as 


ordinary  differential  equation(ODE)  with  variable  coefficient.  When  a t2,  this  special  form 
of  Eq.  (32)  is  called  as  Bernoulli  equation  and  that  has  general  solution  as  shown  later  in 
Theorem  1.  Besides,  when  a  =  2,  Bernoulli  equation  in  Eq.  (56)  is  changed  to  a  first  order 
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differential  equation  as  shown  later  in  Theorem  2.  Thus,  the  optimal  solution  of  g[x )  is 
discussed  in  the  following  selection  of  «  : 

(1)  When  <7^2  ,  a  Bernoulli  equation  is  analyzed. 

(2)  When  a -2,  a  first  order  differential  equation  is  analyzed. 

( 1)  For  the  case  a  *2 

Theorem  1.  The  Bernoulli  equation  is  given  by 

p'(x)  =  a1(x)p(x)+a2(x)p/i(x),  0*0,1,  (57) 

where  f3  can  be  any  real  number  other  than  0  or  1.  The  general  solution  in  [4]  is  given  by 
p1~/i(x)  =  (l-/3)eu{x) je~uMa2(x)dx,  where  u(x)  =  (l-j3)ja1(x)dx,  (58) 

where  the  function  exp[u(x)]  is  called  as  an  integrating  factor. 

Proof. 

Set  co(x)  =  p1^  (x)  for  changing  the  Eq.  (57)  to  a  general  form,  the  derivative  of  co(x)  is 

(o'[x)  =  (1  -  0)pp  (x)p'(x) 

=  (1  -  P)p'P  (x)  [a,  (x)p(x)  +  a2  [x)pp  (x)]  (59) 

=  (1  -  P)ai  (x)ffl(x)  +  (1  -  P)a2  (x). 


which  leads  to  nonhomogeneous  first  order  differential  equation.  Eq.  (59)  can  be  written  in 
another  form  by 

co'(x)-(l-0)a1(x)co(x)  =  (l-0)a2(x).  (60) 


A  new  function  q(x)  is  introduced  to  Eq.  (60)  to  make  the  left  hand  side  of  Eq.  (60)  have 

^  N  t 

co(x) 


the  form  like 


q(x) 


q(x) .  Eq.  (60)  is  rewritten  by 


t o(x ) 

~q[x) 


co'[x)q[x)  -  (1  -  [3)a1  (x)a>(x)q(x)  =  ^  _  a2  (x) 


ql(x) 


q(x) 


(61) 


By  comparing  with 


^«Xx)V 
v  q(x)  j 


a>'[x)q[x)-co[x)q'[x) 


,  the  following  equation  is  derived 


q'(x)  =  (1- 0)a1(x)q(x), 


(62) 


where  q(x)  is  solved  by 


q(x)  =  exp  (1  -0)ja1(x)dx1\i. 


(63) 


Moreover,  Eq.  (61)  provides  another  form  as  follows  by  q(x)  of  Eq.  (63) 

,o2(x) 


(  co[x) '' 


v  q(x)  j 


=(1-P) 


q(x) 


(64) 


Then,  co(x)  can  be  solved  by 


co(x)  =  (1  -  0)q[x)  f  dx. 

J  q(x) 


(65) 
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By  substituting  Eq.  (63)  into  Eq.(65),  the  following  equation  is  derived 

(y(x)  =  (l-/?)exp(u(x))Jo2(x)exp(-u(x))c/x,/?^0,l  where  u(x)  =  (l-j3)^a1(x)dx,  (66) 

which  is  the  general  solution  of  Eq.(57).  □ 

We  employ  the  Theorem  1  to  obtain  the  solution  Eq.  (32).  Corresponding  to  the  definition 
formulas  of  Bernoulli  equation  in  Theorem  1,  we  have 


°i  (x)  = 


a 


a- 1 


/'(x)  1K'(0) 


a 


-1  (x-Eg[X])  f(x)  h  K( 0) 


o2(x)  = 


a 


Ah 


a- 


l(x-fg[X])U 


~K(  0) 


and 


j3--a  +  2  where  2  in  Eq.(55).  Note  that  /?  =  1  is  automatically  excluded,  since 
a  ^  0,1  originally  hold.  The  expression  of  u(x)  is  obtained  by 
u(x)  =  (l-j3)^a1(x)dx 


=  (l  +  cr-2)— [ 
a-  1J 


-  + 


a- 


f'M  1K'(0) 
l(x-Eg[X])  '  f(x)  h  K( 0) 

a  /C'(0) 


dx 


=aln(/M)+-^.n(*-f,[Xl)-|— Jdx. 


(67) 


The  general  solution  of  Eq.  (56)  is  given  by 

n 

Ch 


g a_1  (X)  =  (^K(O) ]  exp (u(x) ) J ^i^-dx, 


(68) 


where  u(x)  =  crln(/(x))  +  01  ln(x-E  [X])-— [dx  and  r6xP(  u^)c/x  js  an 
V  '  a-1  V  9  '  H(0)J  J(x-fg[X]) 

indefinite  integral,  which  is  a  function  of  x.  A  is  Lagrange  multiplier.  To  express  g(x) ,  Eq. 
(68)  is  rewritten  by 

poo 

g(x)dx  = 

J  —00 


aAh 


7  n  ^ 

-m 

h 


i 

«-i 


"•■-"liSw" 


1 

a-1 


dx  =  1.  (69) 


7  n  ^ 

Then,  aXh  —  K(0) 
\h 

aAh 


is  expressed  by 


n 


-m\  = 

\h 


exp 


(“M)J 


exp(-u(x)) 

(x-f,IX]) 


dx 


dx 


(70) 


By  substituting  Eq.(70)  into  Eq.(68),  ga  1[x)  is  given  by 
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flfa_1W  = 


exp(u(x,)/x^i^iydx 


,  a  ^2. 


(71) 


'  exp(-t/(x)) 
{x~Eg[X]) 


exp(u(x))  J  „  '~v — Vi -dx 


a- 1 


dx 


a- 1 


Eq.(71)  is  written  in  another  form  as 

1 

[exp(t/(x))<7(x)]«-i 

/•  °o  |—  — .  ^ 

J  |^exp  (  l/(x))  .^(x)  1  dx 


a*  2, 


(72) 


where  tv(x)  =  crln(/(x))+  g  ^ln(x-Eg[X]) 

(2)  For  the  case  a  =  2 


a  K'( 0) 
h  K( 0)  • 


dx 


and  d(x)  =  J 


exp(-tv(x)) 

(x-Fs[X]) 


Theorem  2.  If  the  equations  have  the  form 

p'(x)  =  a1(x)p(x)  +  a2(x),  (73) 


which  is  called  as  first  order  differential  equation,  the  general  solution  is  given  by 

p(x)  -  exp(u(x))  j  exp(-u(x))o2(x)dx,  (74) 


where  u(x)  =  Jo^xjdx  and  the  function  exp[u(x)]  is  called  as  an  integrating  factor. 

Proof. 

Eq.  (73)  is  rewritten  as 

p'(x)-o1(x)p(x)  =  o2(x).  (75) 


We  assume  the  existence  of  a  new  function  q(x) .  Both  sides  of  Eq.(75)  are  multiplied  by 
q(x) ,  which  is  given  by 


p'Mq(x)  -  a1  (x)g(x)p(x)  =  o2  (x)q(x). 

(76) 

We  assume  q(x)  satisfy  the  following 

-a1(x)q(x)  =  q'(x). 

(77) 

So  Eq.(76)  is  rewritten  by 

(p(x)q(x))'  =a2(x)q(x). 

(78) 

From  Eq. (78),  p(x)  is  obtained  by 

1  (• 

p(x)=  a2[x)q[x)dx. 

q[x)J 

(79) 

Since  q[x)  should  satisfy  Eq. (77),  q(x)  can  be  solved  by 

q[x)  =  exp|-J  o1(x)dxj. 

(80) 

By  substituting  Eq.(80)  into  Eq.(79),  p(x)  is  obtained  by 
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p(x)  =  exp|J  a:(x)dxj  Jexp|-J  a1(x)dx\a2(x)dx, 


which  can  be  written  in  another  form  by 

p(x)  =  exp(u(x))|exp(-u(x))o2(x)dx, 

where  u(x)  =  Jo^xjdx . 

When  a  =  2 ,  the  expression  of  Eq.  (56)  is  changed  to 

„  ,  „  1  /'(x)  1  K'( 0)  .  .  2 Ah  (n 

(x-Eg[X])  /(x)  hK(  0)J  (x-Eg[X]){h 


Corresponding  to  the  definition  formulas  in  Theorem  2,  we  have 

.  .  J  1  f'(x)  1  If '(0)1  ,  ,  2/1/7  (n  Y 

aAx)  =  2  - r  +  - —  ,  aAx)  =  7 - —  —  K  0  .  The  expression  of 

|_(x-fB[X])  /(x)  />K(0)J  (x-Efl[X])U  J 

u(x)  is  obtained  by  substituting  o^x)  into  Ja^xjdx  by 
u(x)  =  J  o1  (x)c/x 


=  2f, 

J  (x-f9[X]j  /(x)  /iX(0) 


=  2ln(/(x))  +  2ln(x-fg[X])- 


2  X'(0) 
h  K( 0)  • 


By  applying  Theorem  2,  the  solution  of  g(x)  is  given  by 


g(x)  =  2Ah( —K(0)\  exp(u(x))  ^dx, 

U  J  V  ^  (X-fg[X]) 


,  ~  \  /  \  2  X'(0)  r  r°° 

where  u(x)  =  2ln  x-£„[X]  +2ln(/(x)) - dx  .  By  the  constraint  g(x)dx  =  1  and 

v  9  ’  y  J  h  K( 0) J 

/•oo 

Eq.(84),  g(x)dx  is  written  by 

•I  —co 

I  g(x)dx  -  2Ah  -K(0)  P  exp(u(x))f  ' " /x  dx  =  l,  (86) 

J-a.  J  J  (x-£g[X]J 


From  Eq. (86),  2Ah  —K( 0)  is  expressed  by 

l  h  J 


2A/7  -X(0) 

l  h 


£|exp(u(x))J^t^x 


By  substituting  2/Ld  —K( 0)  into  (85),  g(x)  is  written  by 

\h 
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gM 


exp(ll(x))c>(x) 


J  [exp(u(x))d’(x)]dx 

where  u(x)  =  2ln(/(x))  +  2ln(x-f Q[X])--^-^ fdx  and  d(x)  =  fexp^  U^}dx. 

V  ’  V  9  ’  h  K(0)1  J  (x-fJX]) 


2.  For  the  case  a  — >  0  or  1 


If  a  — >0  or  1,  Eq.  (4)  is  known  to  converge  to  KL-divergence  given  by 


KL  (/(x)  □  /(x  |  g(x),  S,  D))  =  J  /(x)  In 


/(x) 


/(x  |  g[x),S,D) 


dx 


poo  pOO  A 

[  /(x)ln/(x)dx- [  f(x)\r\f(x\g(x),S,D)dx. 

J  —oo  J  —oo 


(88) 


(89) 


The  expectation  of  /<l(/(x)n/(x|g(x),S,D))  is  applied  to  measure  the  difference  between 
f(x)  and f(x\g(x),S,D)  over  g(x)  as 

Eg  KL^f(x)Uf(x  |  g(x),5,D))J  =  |  /(x)ln/(x)dx- J  ^  f(x)Eg  [in /(x  |  g(x),S,D)]dx.  (gQ) 


Our  research  target  is  to  obtain  the  optimal  g(x)  to  minimize  E 


/C/.(/(x)D/(x|g(x),S;D)) 


as 


3  opt  (x)  =  argmin£0 


g(») 


/0.(/(x)Ci/(x|g(x),5,D))  s.f.  j  g(x)dx  =  1.  (91) 


By  substituting  Eq.(89)  into  Eq.(90),  the  following  expression  is  derived 
g  (x)  =  argminf  (L  (x,g(x)))dx  s.f.  f  g[x)dx  =  l, 

alx)  J-oo  J  —oo 


(92) 


where  L  [x,g[x))  =  -f(x)Eg  [ln/(x  |g(x),S,D)]  .  According  to  Eq.  (9),  L  (x,g(x))  is  set  as 

L  (x,g(x))  =  L  (x,g(x))  +  Ag(x) .  Correspondingly,  Eq.  (11)  is  written  by 
dL  (x,g(x))  dL  (x,g(x)) 


-X 


8g(x)  8g(x) 

8Eg  [ln/(x|  g(x),S,D)] 


(93) 


=  -/(*)- 


5g(x) 


+  X  = 


poo 

0  s.f.  g(x)dx  =  l. 

J  —00 


8Eg  T ln/(x  |  g(x),S,D)l 
Then  - - - 4  js  written  by 


8g[x) 


8Eg  ln/(x | g(x),S,D) 


8g(x )  /(x) 

Here,  we  consider  to  employ  kernel  estimator  to  estimate  /(x|g(x),s,D)  by 


(94) 
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(95) 


f(x  I  g(x),  s,  d)  =  ^  Z  w/(x, 

hti  h 


where  Xiti -1,  - ,n  are  samples  in  the  weighted  sample  set  D  and  g(x)  should  be 
g(x)^ 0  whenever  /(x)^ 0.  w[x)  is  unknown  since  f{x)  is  not  given  in  practice.  However, 
we  assume  that  w[x)  is  given  by  Eq.(ll)  from  /fx)  and  g(x )  in  our  current  problem  setting  as 
noted  earlier  in  the  general  problem  description.  Now,  we  want  to  calculate 
fg  [ln/(x  I  g(x),S,D)],  which  is  written  by 

£9[ln/(x|g(x),S,D)] 


1  n 
-L 


.x-X, 


=  £jin| 

Vht?  h  j 

htigix,)  h 


(96) 


g{y1,---,y„)dy1---dyn. 


Since  xl,  --,Xn  are  i.i.d.  sampled  from  g(x) ,  we  take  the  form  g(y1,--,yn)=g(y1)--g(yn) 
Eg. (96)  is  written  by 


,  ln/(x|g(x),5,D)] 


<»oo  <»oo 

In 

J  —00  J  —00 

l  h^g(y,)  h  J 

(97) 


/  y=i 


=x-Yi 

h 


) ; 

y  y=i 


Set  s(x/f)  =  s(x/[tl;-;tn]r)  =  ln  £ 'L^rS\K(tl) 

multivariate  to  s(x,f) .  Then,  s(x,f)  is  written  by 

V/?s(x,0) 


.  By  applying  Taylor's  expansion  for 


s(*/t)=ZZ 

y=o  i^h 

oo 

=zz 


polish/  /?! 

1  dw's(x,t) 


a -or 


UwP\ dt£-dtpn- 

1  dms[x,t ) 


(t-oK 


(98) 


[t-Of 


=  s(x,0)  +  X^M|  r„ 


«  at, 


for  the  first  order  approximation,  where  fl  =  ,fin) ,  \  fi\=  fi1+- ■  •  +  (3n ,  /?!  =  &!•■•/?„!, 
and  (f)^  =(f1)^1  -•(?„)" ,  in  which  the  term  s(x,t)  at  the  point  f  =  0  is  expressed  by 
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s(x,t)\  =ln  If  f(x-ht^K(t.))  =Jn-lMm 

{ htJ1g(x-hti )  'J  [hg(x) 


Also,  the  term  Gs^xitl  |s  written  by 

dt, 


i  g(x-ht,) 


&M)=  J/(x-H)  -4^-^)^)  +  ^-^)^) 

v ''  \  ' ''  \  1,1 


gr(x-/7t,.) 


g(x-hti) 


Then  at  the  point  t  =  0,  Eq.(lOO)  is  expressed  by 

<9t,  l  g(x)  J  lg(x)J  g(x) 


By  substituting  Eq.(98),  Eq.(99),  Eq.(lOO)  into  Eq.(lOl),  we  obtain 
Eg[\nf(x\g(x),s,D)] 

V  (=1  or,.  t=0  jj=1 

= h"six' °>£- £rW -htj)dt1--dtn+hnYJ £ •  •  •  £ t, n g<x  - - ht] ) dt,  ■  ■  ■  dtn 

7=1  i=l  °li  f=o  7=1 

,  J  "  /M  i  un  { (  *  /(X) 

i/ln\  Y\  J/wY  i/ln\  i  /(x) 


=  ln  £f£(0)  +W  n^O)  -A  ^  K(0)  +  ^(0)  -^(x-Eg[X]) 
lAg(x)  J  ^  g(x)  J  U(x)J  g(x)  h 


,  l  n  f(x)  ,  ,  |  (  ~  \  /'(x)  g'(x)  /C'(O) 

=  ln  -7-bT/c  o  -fx-f  JX  )  - — 

£g(x)  J  v  9  ;L/(x)  g(x)  hK( o) 


Then,  by  taking  the  derivative  of  Eq.(102),  we  obtain 
dEg  ln/(x|g(x),S,D)]  x 

dg[x)  g[x) 


—  -(x-Eg[X])^P- 

g(x)  1  3  g  (x) 


By  substituting  Eq.(103)  into  Eq.(93),  the  following  equation  is  derived  by 

■ -(x-Eg[X])^P-  =  — . 
g(x)  1  3  g  (x)  /(X) 


Eq.(102)  is  transformed  by 


,,  >  g(x)  21  2 

g(x)  =  -7 - r-r-7 - z  t - g  (x). 

(x-Eg[X])  (x-Eg[X])f(x) 


Eq.(103)  can  be  solved  by  Theorem  1.  We  have  o1(x)  =  - 


a?(x)  =  - 7 - — -  and  B  =  2  .  The  expression  of  u(x)  is  obtained  by 

(x-Eg[X])/(x) 
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(106) 


u{x)  =  {l-fi)ja1(x]dx 
1 


-f 


dx 


(x~Eg[X]) 
\n(x-Eg[X])+C, 


where  C  is  a  constant.  The  general  solution  of  Eq.(  105)  is 

g  1(x)  =  lexp(u(x))f^XP^  ^  -dx, 

J  (x-Eg[X])  f[x) 

where  i/(x)  =  ln(x-fg[X])+C  .  To  express  g(x) ,  Eq. (107)  is  rewritten  by 


g(x)=r1 


exp 


(»<*>)/ 


exp(-u(x))  1 
(x-fB[X]j  f(x) 


dx 


(107) 


(108) 


fOO 

By  the  constraint  g(x)dx  =  l  and  Eq. (108),  g(x)  must  satisfy  the  following  condition 

J  —oo 


|  g(x)dx  =  A  1 1 


exp(u(x))J 


exp(-u(x))  1 
(x-fg[X])  f(x) 


dx 


dx  =  1. 


(109) 


Then  A  1  is  expressed  by 

A'1 


exp 


(U(x))j 


exp(-u(x))  l 
(x-ffl[X])  f(x) 


dx 


dx 


By  substituting  Eq.(llO)  into  Eq.(108),  g(x)  is  written  by 

[exp(u(x))d(x)]  1 
g(x)=—^= - — — > 

j  [exp(u(x))d(x)  1  dx 


(110) 


(111) 


,exp(-u(x))  l  , 

](x-Eg[X])f(x) 
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